Domain Ontology Guided Feature-Selection for Document Categorization

نویسندگان

  • Bill B. Wang
  • R I. McKay
  • Hussein A. Abbass
  • Michael Barlow
چکیده

We present a novel method employing a hierarchical domain ontology structure to select features representing documents. All raw words in the training documents are mapped to concepts in a domain ontology. Based on these concepts, a concept hierarchy is established for the training document space, using is-a relationships defined in the domain ontology. An optimum concept set may be obtained by searching the concept hierarchy with an appropriate heuristic function. This may be used as the feature space to represent the training dataset. The proposed method aims to solve some drawbacks suffered by text classification algorithms and feature selection algorithms. One major difficulty for text classification algorithms, especially for machine learning approaches, is the high dimensionality of the feature space. The second major difficulty is to obtain a training dataset of good quality, which is crucial to the performances of almost all text classifiers. Experimental results show that our method solves these problems more reasonably and more effectively than existing methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparative Study for Domain Ontology Guided Feature Extraction

We introduced a novel method employing a hierarchical domain ontology structure to extract features representing documents in our previous publication (Wang 2002). All raw words in the training documents are mapped to concepts in a concept hierarchy derived from the domain ontology. Based on these concepts, a concept hierarchy is established for the training document space, using is-a relations...

متن کامل

Gene ontology annotation as text categorization: An empirical study

Gene Ontology (GO) consists of three structured controlled vocabularies, i.e., GO domains, developed for describing attributes of gene products, and its annotation is crucial to provide a common gateway to access different model organism databases. This paper explores an effective application of text categorization methods to this highly practical problem in biology. As a first step, we attempt...

متن کامل

Two New Approaches to Feature Selection for Document Categorization

Due to the huge volume of text documents available on the Internet, it is increasingly necessary to effectively manage them and then help users to retrieve what they want. Document categorization can organize documents into domain specific classes and so facilitate information retrieval. In general, most of document categorization systems are composed of three kinds of models: one for weighting...

متن کامل

Interactions Between Document Representation and Feature Selection in Text Categorization

Many studies in automated Text Categorization focus on the performance of classifiers, with or without considering feature selection methods, but almost as a rule taking into account just one document representation. Only relatively recently did detailed studies on the impact of various document representations step into the spotlight, showing that there may be statistically significant differe...

متن کامل

Iterative Ontology Selection Guided by User for Building Domain Ontologies

In this paper we present a new method for ontology selection in a reuse context. The novel feature of this method is the iterative selection of the reused ontologies. Ontology selection is guided by the user according to his requirements and his perception to the target domain. Starting from a first selected ontology, the concepts with the weakest density are identified then the ontology develo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007